-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_excel for ods files raising UnboundLocalError in certain cases #36175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/excel/_odfreader.py
Outdated
super().__init__(filepath_or_buffer, storage_options=storage_options) | ||
|
||
def _monkeypatch_odf_element_str(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, why don't you just change the impl to be correct? we never want to monkeypatch things in library code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should not be monkeypatching the library code and ideally this should be fixed in odfpy directly.
If you look at the fix it essentially does the exact thing that the library does https://github.com/eea/odfpy/blob/master/odf/element.py#L240-L244 except add the case that PR #33233 addresses. This is relevant mostly because these Nodes can get specially if the nested child nodes also contain multiple spaces that #33233 addresses. Then it would become a recursive implementation.
For a cell like this:
there is no way to get all the spaces without recursive (or equivalent). The original PR is also somewhat of a monkeypatch as it essentially replaced str(cell)
with _get_cell_string_value(cell)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right why can't you just change _get_string_cell_value to what you are doing here? (and put a nice comment / link to the source & issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. done
not sure why the tests are failing. they seem unrelated.
Looks good - can you add a what’s new note for 1.2 bug fix section? |
@WillAyd added what's new note |
# https://github.com/pandas-dev/pandas/pull/36175#discussion_r484639704 | ||
value.append(self._get_cell_string_value(fragment)) | ||
else: | ||
value.append(str(fragment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to call str(fragment) instead of always just calling self._get_cell_string_value(fragment)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we'd need to handle the str
method for all the different cases which would require going into the spec. odfpy
already has the str implementations done. the only case that was problematic was the multiple spaces that was addressed in the previous linked PR.
value.append(" " * spaces) | ||
else: | ||
# recursive impl needed in case of nested fragments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reader it isn't clear to me what this means; is this a bug that needs to be fixed upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before #33233, the value was str(cell)
which relied on the upstream odfpy implementation.
#33233 attempted to fix cases where multiple spaces was getting skipped in the output. That is because odfpy's str implementations for Nodes/Elements that are of that specific type (from odf.text import S
) does not include the actual number of spaces. This imo should have been addressed upstream.
#33233 also introduced the bug that is causing the current UnboundLocalError as one line is misaligned. But fixing the misalignment adds in more bugs. With just the indentation fix, the output will still be missing certain fragments from the cell (for the file in #36122).
This was my original reason for doing a monkeypatch that patched __str__
implementation of the Element
nodes to include the S
spaces Node. I changed the monkeypatch but instead changed the implementation of _get_cell_string_value
to have the same behavior as the monkeypatch. It still relies on odfpy
's str implementation for all other Element types except for the Space Node, otherwise it pretty much mirrors Element.__str__
from odfpy.
Sorry for the wall of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, but lgtm.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -296,6 +296,7 @@ I/O | |||
- :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`) | |||
- :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`) | |||
- :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`) | |||
- Bug in :meth:`read_excel` with `engine="odf"` caused UnboundLocalError in some cases where cells had nested child nodes (:issue:`36122`, and :issue:`35802`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double back ticks on UnboundLocalError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
thanks @asishm |
from #36122 (comment)
could this fix go in 1.1.3 |
yep could backport |
ok will raise PR on master first to move release note and then do a manual backport. can't auto backport this (followed by release note move) as no doc/source/whatsnew/v1.2.0.rst on 1.1.x |
second thoughts will do the backport first and then if all ok move release note on master to be in sync |
…nboundLocalError in certain cases
…lError in certain cases (#36355) Co-authored-by: Asish Mahapatra <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Need some guidance on tests here. Do we need tests for other file formats here as well? Also unsure about naming conventions.
The primary bug in this case (apart from the indentation problem in the original code) is that the code was ignoring cases where the XML of the cell had multiple child nodes that were not spaces
reverted implementation from #33233 incorporating part of its modifications.